A New Parallelization Method for K-means
نویسندگان
چکیده
K-means is a popular clustering method used in data mining area. To work with large datasets, researchers propose PKMeans, which is a parallel k-means on MapReduce [3]. However, the existing k-means parallelization methods including PKMeans have many limitations. It can’t finish all its iterations in one MapReduce job, so it has to repeat cascading MapReduce jobs in a loop until convergence. On the most popular MapReduce platform, Hadoop, every MapReduce job introduces significant I/O overheads and extra execution time at stages of job start-up and shuffling [2]. Even worse, it has been proved that in the worst case, k-means needs 2"($) MapReduce jobs to converge [4, 5], where n is the number of data instances, which means huge overheads for large datasets. Additionally, in PKMeans, at most one reducer can be assigned to and update each centroid, so PKMeans can only make use of limited number of parallel reducers. In this paper, we propose an improved parallel method for k-means, IPKMeans, which has a parallel preprocessing stage using k-d tree [8] and can finish k-means in one single MapReduce job with much more reducers working in parallel and lower I/O overheads than PKMeans and has a fast post-processing stage generating the final result. In our method, both k-d tree and the new improved parallel k-means are implemented using MapReduce and tested on Hadoop. Our experiments show that with same dataset and initial centroids, our method has up to 2/3 lower I/O overheads and consumes less amount of time than PKMeans to get a very close clustering result.
منابع مشابه
MPI- and CUDA- implementations of modal finite difference method for P-SV wave propagation modeling
Among different discretization approaches, Finite Difference Method (FDM) is widely used for acoustic and elastic full-wave form modeling. An inevitable deficit of the technique, however, is its sever requirement to computational resources. A promising solution is parallelization, where the problem is broken into several segments, and the calculations are distributed over different processors. ...
متن کاملModification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis
Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...
متن کاملEmbed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce
The kernel k-means is an effective method for data clustering which extends the commonly-used k-means algorithm to work on a similarity matrix over complex data structures. It is, however, computationally very complex as it requires the complete kernel matrix to be calculated and stored. Further, its kernelized nature hinders the parallelization of its computations on modern scalable infrastruc...
متن کاملA hybrid DEA-based K-means and invasive weed optimization for facility location problem
In this paper, instead of the classical approach to the multi-criteria location selection problem, a new approach was presented based on selecting a portfolio of locations. First, the indices affecting the selection of maintenance stations were collected. The K-means model was used for clustering the maintenance stations. The optimal number of clusters was calculated through the Silhou...
متن کاملScalable Embeddings for Kernel Clustering on MapReduce
There is an increasing demand from businesses and industries to make the best use of their data. Clustering is a powerful tool for discovering natural groupings in data. The k-means algorithm is the most commonly-used data clustering method, having gained popularity for its effectiveness on various data sets and ease of implementation on different computing architectures. It assumes, however, t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1608.06347 شماره
صفحات -
تاریخ انتشار 2016